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Abstract 

Many tasks in natural language processing, rang¬ 
ing from machine translation to question answer¬ 
ing, can be reduced to the problem of matching 
two sentences or more generally two short texts. 
We propose a new approach to the problem, called 
Deep Match Tree (DEEPMATCH tree ), under a gen¬ 
eral setting. The approach consists of two compo¬ 
nents, 1) a mining algorithm to discover patterns 
for matching two short-texts, defined in the prod¬ 
uct space of dependency trees, and 2) a deep neural 
network for matching short texts using the mined 
patterns, as well as a learning algorithm to build 
the network having a sparse structure. We test our 
algorithm on the problem of matching a tweet and 
a response in social media,_a_hard matching prob¬ 
lem proposed in IWang et al., 2013), and show 
that DEEPMATCH tr . ee can outperform a number of 
competitor models including one without using de¬ 
pendency trees and one based on word-embedding, 
all with large margins. 


1 Introduction 

Matching is of central importance to natural language pro¬ 
cessing. In fact, many problems in natural language pro¬ 
cessing can be formalized as matching between two short- 
texts, with different matching relations in different applica¬ 
tions. For example, in paraphrase identification the relation 
is synonymy, and in information retrieval it is relevance. In 
the meantime matching is also a challenging problem, since 
it requires modeling of the two short-texts as well as their re¬ 
lation. In machine translation, for example, the model needs 
to determine whether a sentence in the source language has 
the same meaning as a sentence in the target language. In 
dialogue, the model needs to judge whether a message is an 
appropriate response to a given utterance. 

Deep neural network can model non-linear and hierar¬ 
chical relations iBengio, 20091, and thus is well suited for 
short-text matching in natural language processing. The 


very limited work in that thread, makes use of word em¬ 
bedding as the building blocks of matching model. Al¬ 
though embedding-based methods have been proven effective 
on tasks like questio n answering [Lu and Li, 20131, para¬ 
phrase identification I Socher et al., 201 1|, and even short 
text conversation |Lu and Li, 2013 Hu et al., 2014], they 
are not enough good at handling the subtlety of general 
short-text matching. Short-texts often represent rich con¬ 
tent, their relations are also complicated, and more sophis¬ 
ticated structures are required for comparing the two short- 
texts. For example, when judging the appropriateness of re¬ 
sponse “You should rest more to utterance “i have 
to work during the weekend!”, we have to consider 
the semantic correspondence between “work over the week¬ 
end” and “need to rest more”, which is hard to be captured by 
an embedding-based model. 

We study the problem of short-text matching in a gen¬ 
eral setting. Our method, named Deep Match Tree 
fDEEPMATCHt ree ), consists of two sequentially connected 
components: 1) a mining algorithm to discover rich yet sub¬ 
tle patterns, defined in the product space of dependency trees, 
from a large corpus of paired short-texts, and 2) a learning al¬ 
gorithm to construct a deep neural network (DNN) for mak¬ 
ing a matching decision on the two short-texts, on the basis 
of the mined patterns. The DNN model is specifically trained 
based on contrastive sampling of negative examples. 

Without loss of generality, we focus on the task of match¬ 
ing a response to a given tweet on Weibo, a popular Chinese 
microblog service, for which a large amount of data is avail¬ 
able. This is a hard problem, requiring consideration of com¬ 
plicated correspondence between the structures of two texts. 
Our experimental results show that DEEPMATCH tree is supe¬ 
rior to existing methods on the problem. 

Our main contributions are: 1) proposal of an algorithm for 
mining dependency tree matching patterns on large scale, 2) 
proposal of an algorithm for learning a deep matching model 
for using mined matching patterns, and 3) empirical valida¬ 
tion of the efficacy and efficiency of the proposed method 
using large scale real datasets. 

2 Direct Product of Graphs (PoG) 


‘this work is done when the first author worked as intern at 
Noah’s Ark Lab, Huawei Technologies. 


We first propose representing the matching of a pair of sen¬ 
tences (in general short-texts) with the direct product between 
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Figure 1: The overall architecture for DEEPMATCH tree . 


the dependency trees of them, and then propose treating sub¬ 
graphs of this product graph as matching patterns. 


2.1 Dependency Tree 

We represent a sentence with its dependency tree. We choose 
to do so because a dependency tree tends to expose the “skele¬ 
ton” of the sentence, revealing both short-distance and long- 


distance grammatical relations between words I Filippova and 


Strube, 20081. For example the dependency tree in Fig|2]con- 


tains structures like {Li NaV-win—)-championship} repre¬ 
sented as a sub-tree,where the words (boldface) are not nec¬ 
essarily adjacent to each other in the sentence. 


Last week Li-Na won her sixteenth championship in her career 



Figure 2: Example of dependency tree, where the main struc¬ 
ture of the sentence is represented as the sub-tree in thick 
edges. The tweet is in Chinese (literal English translation). 


2.2 Direct Product of Graphs 

The direct product of graphs (PoG) Gx = I Vx , Ex I and 


G y = {Vy,Ey}, is a graph Gxxy [Vishwanathan et al, 
2010], with vertices Vx xy and edges Exxy 


Vxxy = vf £ Vx, € Vy} 

Exxy = {((vf, v], )(yf , vj,)), (v?, vf) 

£ Ex A (vji, vji) £ Ey} 


Given two sentences Sx and Sy, their interaction relation is 
represented by the direct product of their dependency trees 



Figure 3: The direct product of two dependency trees. 


GxxY- F° r example, two sentences Worked all night 
and Have a good rest (with their dependency trees are 
given by the left panel of Fig(3]), and the direct product of 
the trees is given by the right panel of Fig|3] Note that Gxxy 
is in general a graph even though Gx and Gy are trees. 

Gxxy directly describes the interaction relation between 
sentences Sx and S Y , hosting a rather rich set of struc¬ 
tures, both lexical and syntactic, that contribute to the overall 
matching between the two sentences. Next we make further 
abstraction of the representation. 


2.3 Abstraction 


We consider two types of abstraction for vertices in Gxxy 
• Same Entity: We replace the vertex (ne,, ne,;) in 
Gxxy representing the same entity with a general vertex 
SameEntity. For example for the sentences (how is 
the wether in Paris?) (Haven't seen such a 
sunny day in Paris for a while!), the vertex 
(Paris, Paris) after the abstraction will be treated as 
the same vertex as (Boston, Boston) after the same 
type of abstraction. Graph with this type of abstraction 
is named Gxxy- 


• Similar Word: We conduct^clustering of words based 
on their word2vectors [Mikolov et al., 20131 using the 
K-means algorithm. For a vertex (■ Wi , Wj) in the prod¬ 
uct graph, if Wi and w :j belong to the same word clus¬ 
ter Cfc, then the vertex will be replaced with a new ver¬ 
tex SimWorD/j. Graph with this type of abstraction is 
named G'xxY- 


























Both types of abstraction will enhance the generalization abil¬ 
ity of matching pattern mining described next. 

2.4 Sub-graphs of PoG as Matching Patterns 

With a little abuse of notation, we use Gxxy = 
{Gxxy, G'x x y■ Gxxy} t0 d en °te the PoG for sentence pair 
(Sx, Sy) as well as its variants after two types of abstraction. 
For a sentence pair ( Sx,Sy ), any sub-graph in the corre¬ 
sponding Gxxy describes part of the interaction between the 
two sentences and therefore can contribute to the matching 
between the two. For instance, (weather, sunny) 4—> 
SameEntity is a sub-graph describing the matching between 
two sentences in a conversation about weather (see the exam¬ 
ple two paragraph ago). In general, Gxxy contains all the 
meaningful matching patterns for the task. 


3 Mining of Matching Patterns 

It is the responsibility of a mining algorithm to discover those 
sub-graphs of { Gx x y } that can work as matching patterns to 
discriminate matched sentence pairs from mismatched ones, 
measured in terms of discriminative ability (cf.jFan et al., 
2008]). Discriminative roughly means it gives some evidence 
on matching, i.e., it appears in matched pairs more frequently 
than unmatched pairs. An efficient mining algorithm is vital 
to the success of this method, when the number of instances 
is of the order of 10 6 and the number of mined patterns is of 
the order of 10'. 


3.1 Speeding-up the Mining Process 

Fortunately, we can leverage the following fact with respect 
to sub-graphs in the PoG Gxxy (without abstraction). 

Proposition 3.1. Any connected sub-grapli Gxxy ‘ n Gxxy 
can uniquely determine a minimal sub-tree in Gx and a min¬ 
imal sub-tree Gy, whose direct product can cover the Gxxy- 

As it implies, the mining of sub-graphs in PoG of trees 
can be reduced to jointly selecting the sub-trees on two 
sides. This can not only greatly speed up the mining pro¬ 
cess, but also avoid finding patterns with duplicate function¬ 
ality for matching. In the remainder of the paper, we will use 
sub-treei (g) sub-tree 2 to denote a tree-pair (separated by 
<g>) mined from the PoG. This however does not apply to the 
more general case of Gxxy when some vertices are replaced 
with non-factorable variants, like S 1 MW 0 RD 123 , for which we 
have to introduce some new tricks. 


3.2 Mining without Abstraction 

The algorithm for mining without abstraction, sketched in 
Algorithm 1, is to recursively grow the mined sub-graphs 
while maintaining its discriminative ability. It starts with 
the simplest pattern (1,1), standing for one-word tree on both 
X side (tweet) and Y side (response), and grows the mined 
trees recursively. In each growing step (LeftExtend () and 
RightExtend ()), the size of sub-trees is increased by one 
on either X side or the Y side, followed by a filtering step to 
remove the found pairs with discriminative ability less than a 
threshold. The growing step is efficient, since we can limit 
the search for patterns of (to, n + 1) from the candidates 


formed by merging patterns of (to, n). In practice the time 
for looking-up each sub-tree pair is almost constant with the 
help of Hashmap. The following table gives some examples 
of the matching patterns discovered by Algorithm 1. 


Algorithm 1: Discriminative Mining of Parse Trees for 
Parallel Texts _ 

Input: T: tree pairs for original (tweet, response), MaxSize 
Output: Set of mined features X; 

Initinalize X 4 — 0; M 4— 0; Q 4- []; 

ENQUEUE (Q, (1,1)); 
foreach node set x,y in X,y do 

Append each element of x® y to X\.\ ; 

Append x 0 y to Adi,i; 

-Fig t— DiscriminativeFilter (7-i,i); 

while Q =A [ ] do 

m,nf- DEQUEUE (Q); 

if m + 1 < MaxSize A (m + 1, n) has not been processed 

then 

[Afm+l.nj-Fm+l,n] t LsftExtSnd i,Adm,n) , 
ENQUEUE (Q, (m + 1, n)); 

X 4— T U T m + i, n ; 

if n + 1 < MaxSize A (to, n + 1) has not been processed 

then 

[M m,n-\- 1 ,X n't,n + l] t Ri gilt Ext Bnd (AdnijU ) , 
ENQUEUE (Q, (to, n + 1)); 

X t X U X m,n + l. 


Patterns without abstraction 

exam (g> score 

Information theory (g) Shannon 
thank—^present (g) happy—^birthday 
win->game (g) trying—^keep 
out-of-control—>prices (g> regulation 
work—^weekend (g> rest 


3.3 Mining with Abstraction 

The algorithm for mining with abstraction is a variant of Al¬ 
gorithm 1. Taking the SameEntity abstraction as example, 
we first replace each named entity, e.g., Li Na (found via a 
named entity resolution program) with a vertex having the 
same ID (say, NamedEntity239). The growing step is the 
same as in Algorithm 1, except that when counting the sup¬ 
port (number of instances containing it) of a pattern, it re¬ 
places the same entity appearing on both sides with a wild¬ 
card, and therefore groups many patterns as the same one. 
For example, the instances for the following two patterns 

Li Na 4— win (g) Li Na 4— congratulations 
Nadal <— win (g) Nadal 4— congratulations 

will be counted together for the pattern 

X 4— win (g> x 4— congratulations 

where x stands for the wildcard. The mining with Similar- 
Word abstraction is similar, only slightly more complicated 
on deciding when two words can be merged. 
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Figure 4: Illustration for dependency trees for tree short sentences. 


The following table gives some examples of the matching 
patterns discovered by our algorithm on the graph with ab¬ 
straction. Here (x, x') stand for wildcards considered similar 
enough by the algorithm. 

Patterns with abstraction 

hope—>win- >X §5 support— >X 
how about— >X <g ) like— >X 
gift— >X C2> happy— tx 
recommend— >X (g> X— >nice 
pretty good— >X ® fine—>also— >x' 

3.4 Advantage of Tree Pattern Mining 

It is important to note that dependency tree matching patterns 
can provide better correspondence between the two sentences 
than word co-occurrences on two sides (an idea first explored 
in ILu and Li, 20131). To illustrate the superiority of using 
dependency tree matching patterns, suppose that for the tweet 
T1 in Fig|4]we want to pick a more appropriate one from the 
two responses (R1 a and R1 n). The word-based model tends 
to assign a high matching score to pair (T1, R1 b), due to 
the pattern {Beijing, travel} (S> {Great Wall}, which 
is however spurious since T1 is about traveling to New York 
while the word Beijing is a distractor. On the other hand, 
the tree-based model relies more on patterns like follows 


travel —> New-York Times Square 

travel —¥ Beijing ® Great-Wall 


which discriminates between word co-occurrence (e.g., 
{Beijing, travel}) and dependency-tree pattern (e.g., 
travel—>New York), and gives a higher score to (T1, R1 a). 

The mining algorithm allows us to find patterns represent¬ 
ing deep and long-distance relationship within two short-texts 
to be matched. The deep features therefore provide sophisti¬ 
cated matching structures between two texts. In contrast, the 
shallow features can only give word-level correspondences 
between words in two texts. The difference is analogous to 
syntax-based translation model and word-based translation 
model I Koehn et al., 20031 • 


4 The Deep Matching Model 

The dependency tree matching patterns (or deep features) are 
then incorporated into a deep neural network for determining 
the matching degree of a pair of short-texts. 


4.1 Model Description 

The diagram of our deep matching model is given in Fig JT] 
When a pair of short-texts is given, we first obtain their de¬ 
pendency trees, form the direct product of them, and then per¬ 
form abstraction on them (if suitable). After that, we look up 
the table of dependency tree matching patterns and convert 
the input text-pair into a binary vector, where an element is 
one if the corresponding pattern can apply to the input text- 
pair, otherwise it is zero. The binary vector, which is of 10M- 
dimension and is sparse with typically 10~50 ones in our ex¬ 
periments, is then fed into the deep neural network for the 
final match decision. 

4.2 Learning 

The learning of the deep neural network consists of 1) learn¬ 
ing of the architecture, and 2) tuning of the parameters. 

Architecture Learning 

Since there are 10 M raw features, the number of parameters 
will be too large if we have the input layer fully connected 
to the first hidden layer with a reasonable size (say, 1,000 
nodes). It is therefore necessary to specify sensible sparse 
patterns to ensure that the information in the raw features can 
be well abstracted in the first hidden layer. 

It is believed that neural networks are more suited for dense 
and continuous input, and there is little work on building an 
appropriate architecture for sparse and discrete input with a 
demanding size. In this work, we take a simple procedure 
to ensure each input node is connected to approximately K 
(referred to as NodeDensity later in the paper) hidden nodes, 
and the average activations of the hidden nodes (measured 
as the average times of them connected to hit features) are 
approximately the same. The underlying belief is that we can 
preserve as much information as possible when going from 
the sparse hit patterns to the dense 1,000-D representation. 


The Selection of Overall Architecture The overall archi¬ 
tecture of the neural network is illustrated in FigJT] As it 
shows, we have 1,000 units in the first hidden layer (sigmoid 
active function), 400 in the second hidden layer, 30 in the 
third hidden layer, and one in the output layer. Empirical re¬ 
sults show that this architecture performs slightly better than a 
3-layer one with approximately same number of parameters. 

























while more hidden layers (say, 5) do not bring any significant 
further improvement. 

4.3 Parameter Learning 

We employ a discriminative training strategy with a large 
margin objective. Suppose that we are given the following 
triples (x,y + ,y“) from the oracle, with x (G X) matched 
with y + better than with y~ (both G y). We have the follow¬ 
ing pairwise loss as objective: 

C(yv,v trn )= e w (x i5 y+,y“) + R(W), 

(Xi,y+ ,y i _ )6'Dtrn 

where f?(W) is the regularization term, and eyy (xj, yf, y“) 
is the error for triple (xj, yj 1 ", y“), given by the following 
large margin form: 

ew(x„ y, + , y,“ ) = max(0, m + s(x,,y,") - s(x;,y+)), 


with to. (0 < to) controlling the margin in training. In the 
experiments, we use to = 1 , but we find that the results are 
rather stable with to in a fairly large range. 

For training, we use the generic back-propagation algo¬ 
rithm adapted for the sparse patterns in the first layer. More 
specifically, when updating the weights in the first layer, we 
only update the weights associated with the active nodes in 
the input layer, which faithfully respects the law of back-prop 
but makes the learning efficiently enough even on a training 
set with millions of instances. It is easy to see that the num¬ 
ber of parameters (at least 4 x 10 ‘) is greater than the number 
of positive instances, and thus some kind of regularization is 


needed. H ere we consider empl oying both dropout 1 Hinton et 
al., 2012] and early stopping [Caruana et al., 200 fj", which 


turns out to be important for the success of the model, espe¬ 
cially when the number of parameters is over 10 s . 


5 Experiments 

We report our empirical study of DEEPMATCH tree and com¬ 
pare it to competitors, with a brief analysis and case studies. 

5.1 Datasets and Evaluation Metric 

The experiments are on two Weibo datasets in two settings. 


Original-vs-Random: The first dataset, denoted as 
DataOrignal, consists of 4.8 million (tweet, response) pairs. 
For each positive pair (original pair), we randomly select 
ten responses as negative examples (contrastive sampling 
of negative examples), rendering 45 million triples. Our 
evaluation shows that for a given tweet there is <1% chance 
that a randomly selected response out of 10 is suitable. 

We use 485,282 original (tweet, response) pairs not used in 
the training for testing. For each pair, we get nine random re¬ 
sponses, and testing the performance of each matching model 
to pick the correct response. 


Retrieval-based Conversation: The second dataset, de¬ 
noted as DataLabeled, consists of 422 tweets and around 30 
labeled responses for each tweet Q as introduced in I Wang et 
al., 20131 for retrieval-based conversation. 


On DataLabeled, we test how different matching models 
enhance the performance of the retrieval-based conversation 
model [Wang et al., 20131 on finding a suitable response for 
a given tweet. It is rather hard, since the negative responses 
are topically related to the tweet. We use the same retrieval 
strategy in I Wang et al., 20131, while individually adding the 
scores of the matching models as a new feature of the ranking 
function to rank retrieved responses (20-30 for each tweet). 


In both experiments we use precision at one (P@l) [Li, 


2011] to measure the accuracy of matching. Basically, for 


each given tweet T, we calculate the matching scores be¬ 
tween T and all candidate responses, and select the one with 
the highest score. The ranking gets one point iff the se¬ 
lected one is the original (on the Original-vs-Random dataset) 
or labeled as “good” (on the Retrieval-based Conversation 
dataset). P@1 measures the chance of getting the selection 
right averaged over all the tweets in the test set. 


5.2 Competitor Methods 

• Translation: We use the translation probability 
p(response|tweet) to measure the matching level be¬ 
tween the response and tweep] which is estimated on 
a variant of IBM model 1 [Brown et al., 19931 adapted 
for this task. 


• CosSlM: We simply calculate the cosine similarity be¬ 
tween two short-texts with their TF-IDF representations. 
This method is still better than random since a good re¬ 
sponse tends to share words with the original tweet; 

• WordEmbed: We represent each short-text as the sum 
of the embedding vectors of the words which it contains. 
The matching score of two short-texts is calculated using 
a multi-layer perceptron (MLP) with concatenation of 
the two vectors as input; 

• DIiepM atcH, , mc : We employ the matching model in 
I Lu and Li, 20131 on the basis of topics and train a neural 
network with 3 hidden layers and 1,000 hidden nodes in 
the first hidden layer; 


DEEPMATCH cfm : We exploit the matching model pro¬ 
posed in [Hu et al., 20141 represented as a convolutional 
neural network (CNN). 


• LR tree : To show the power of mined patterns we also 
train a logistic regression model taking all the mined 
patterns as input with the contrastive sampling training 
strategy. This can be viewed as a shallow version of 
DEEPMATCH tree . 


The methods can be roughly categorized into pattern- 
based methods (CosSlM, TRANSLATION, LR tree , & DEEP- 
MATCH tr . ee ) and embedding-based methods (WordEmbed, 
DeepMatch, & DeepMatch c „„), where embedding- 
based methods represent each word with a vector, based on 
which the final matching decision is made. 


'Data: data . noahlab. com. hk/conversation/ 

2 This performs slightly better than p(tweetjresponse). 





































All non-convex models are trained with stochastic gradient 
descent (SGD) [Le, 20131.We find that their performances are 
in general quite insensitive to the size of mini-batch. 


5.3 Results on Original-vs-Random 

In this section we present the results in the orig nal-vs-random 
setting. For each model, we only report its best performance 
on the test data, since the large size of test data removes any 
chance of “accidental cheating”. We first study the architec¬ 
ture variations of DEEPMATCH tree , and then compare its best 
setting against the competitors. 

Here we compare the performances of DEEPMATCH tree 
under different settings, more specially, the number of hidden 
layers (1 ~5), NodeDensity (1 —20) and architecture learning 
(details of results are omitted). In a nutshell, the performance 
peaks around NodeDensity=10 with architecture learning. 
With NodeDensity > 10, the matching model has over 10 8 
parameters, and it needs regularization (e.g., dropout) to pre¬ 
vent overfitting in addition to early stopping. The influence 
of architecture learning is most salient for a relatively large 
NodeDensity (say, > 3), while the number of hidden layers 
stops bringing significant improvement when > 3. Gener¬ 
ally we found that architectures deeper and larger than the 
current one does not bring any significant improvement but 
much slower. 

Comparison to Competitor Models 

Table [I] compares DEEPMATCH f ree to the competitor mod¬ 
els. As it shows, our model outperforms all the competitor 
models with large margins. The contribution of deep archi¬ 
tectures is manifested by the differences between the deep 
architectures and shallow ones with the same mined patterns. 


Model 

P@1 (lvl) 

P@1 (lv9) 

CosSim 

0.554 

0.377 

DEEPMATCH topic 

0.701 

0.330 

WordEmbed 

0.774 

0.370 

Translation 

0.819 

0.586 

DEEPMATCH cran 

0.851 

0.496 

LRfree 

0.853 

0.652 

DEEPMATCH tr . ee 

0.889 

0.708 


Table 1: The results of all models on Original-vs-Random. 
DeepMatch tree signicantly outperforms all the baselines (p < 
0.01 from t-test). 


There is a vast gap between pattern-based models and 
embedding-based models. Although the embedding-based 
methods can perform fairly well on the one versus one (lvl) 
setting (0.85+), the performance drops dramatically in the one 
versus nine (lv9) settings (0.49+), while the pattern-based 
methods can maintain over 0.55 (dropping from 0.80+) in the 
same test setting. This contrast suggests that pattern-based 
methods, with varying coverage in the feature space, are more 
certain on “matched” positive cases than on negative cases, 
yielding more reliable ranking results. 


5.4 Results on Conversation Data 

For each model, we use 5-fold cross validation to choose the 
hyper-parameter of the ranking model RankSVM and report 
the best result. Clearly DEEPMATCH tree can greatly improve 
the performances of retrieving a suitable response from the 
pool, with significantly better accuracies over the competi¬ 
tor models(/< < 0.05 from t-test). This result is consistent 
with the result on Original-vs-Random despite the difference 
in experimental setting. 


Model 

P@1 

Baseline 

0.574 

+DEEPMATCH toplc 

0.587 

+WordEmbed 

0.579 

+TRANSLATION 

0.585 

+DEEPMATCH tree 

0.608 


Table 2: The results on retrieval-based conversation. 


5.5 Analysis and Case Study 


Deep vs. Shallow Patterns Deep patterns represent infor¬ 
mation that cannot be adequately modeled by shallow pat¬ 
terns in a deep neural network. Indeed, our study shows that 
on Original-vs-Random data, P@1 decreases to 0.871 (lvl) 
and 0.688 (lv9) after removing the deep features. Below is 
a real case in our experiment. This observation is interesting 
since feature learning is previously often taken as partially the 
responsibility of deep learning. 

T1 

Sigh, have to work this weekend 

R1 

^ You should rest more 

R1 'i 

B It is hard to find a job now, better prepare your resume 


When trying to find a matched response for T1 , the “deep” 
pattern {work -+ weekend 0 rest} plays a determining 
role in picking RIa over R1 s, while shallower patterns as 

{work 0 job} and {work 0 resume} favor RIb. 


The effect of abstraction The abstraction step helps im¬ 
prove the generalization ability of the matching model, by 
improving P@1 on Original-vs-Random from 0.876 to 0.889 
(lvl) and 0.694 to 0.708 (lv9). This can also be illustrated 
with the following real example from our experiment, 

Hope Mavericks can win the game tomorrow 
R2 & 

^ Didn't know you root for Mavericks too 

R2 XiiiMn, 

B Go! Brothers 























Suppose that for T2 we want to pick a more appropriate re¬ 
sponse from candidates R2 a and R2/ ; . The mining algo¬ 
rithm (Algorithm 1) discovers the following pattern after the 
SameEntity abstraction 

hope -4 win —> X <E> support -4 X 

where x stands for any named entity. This pattern (and its 
own sub-patterns) then plays an important role in the later 
matching model in assigning a higher matching score to (T2, 
R2a), covering more specific patterns like 

hope—>win-^Mavericks <S> support—^Mavericks 

which are filtered out in the mining step for its small support. 

6 Related Work 

The proposed model is related to several threads of work in 
natural language processing and machine learning. 


Deep Matching Models There are other works on using 
deep neural networks for the matching task I Huang et al.. 


20l3l|Bordes et al., 20141|Sun et al, 20l3]|Lu and Li, 20131 

Hu et al., 2014[ , which build upon given or learned represen¬ 
tations of objects. In our model, we try to directly mine and 
learn the representations of matching. 


Graph-based Kernel DEEPMATCH free extends the impor- 


tant notio ns in conventional graph kernels I Vishwanathan et 
al., 2010) in two senses. First, our model allows matching 


of two different subgraphs in two domains (e.g., {work —> 
weekend} in one domain and {have —> rest} in the other), 
while graph kernels only consider the common subgraphs on 
two sides. Second, our model captures the nonlinear and 
hierarchical relations between different matching patterns, 
while graph kernels simply add them together, with different 
weights determined by the types of sub-graphs. 


String-Rewriting Kernel DEEPMATCH tree is also related 
to the string-rewriting kernel (SRK) |Bu et al., 20131 forpara- 
phrase identification, in that SRK also generates many pat¬ 
terns of matching and learns to weigh them in training. The 
main difference is the matching patterns considered in SRK 
are exhaustively enumerated (although calculated in a smart 
way), while ours are discovered via a mining algorithm. 


7 Conclusion 

We propose a generic model for matching two short-texts, 
which relies on a tree-mining algorithm to discover a vast 
amount of matching patterns and a DNN to further perform 
the task using those patterns. Empirical study on the rather 
difficult task of tweet and response matching shows that our 
model can outperform competitor with large margins. 
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